Detecting novel cell type in single-cell chromatin accessibility data via open-set domain adaptation

Abstract Recent advances in single-cell technologies enable the rapid growth of multi-omics data. Cell type annotation is one common task in analyzing single-cell data. It is a challenge that some cell types in the testing set are not present in the training set (i.e. unknown cell types). Most scATAC-seq cell type annotation methods generally assign each cell in the testing set to one known type in the training set but neglect unknown cell types. Here, we present OVAAnno, an automatic cell types annotation method which utilizes open-set domain adaptation to detect unknown cell types in scATAC-seq data. Comprehensive experiments show that OVAAnno successfully identifies known and unknown cell types. Further experiments demonstrate that OVAAnno also performs well on scRNA-seq data. Our codes are available online at https://github.com/lisaber/OVAAnno/tree/master.


Introduction
Recent single cell sequencing technologies have provided the opportunities to facilitate the analysis of cells in complex tissues at high resolution.To analyze these data, one important task is cell types annotation [1,2].The general step of cell annotation is preprocessing data, removing batch effects, clustering cells through unsupervised clustering, finding gene markers, and finally assigning a cell type to each cluster based on markers.However, as the amount of data grows rapidly, so do the time and labor cost of this annotation procedure [3].The set of markers may be different by using different methods, leading to inconsistance in the annotation results [4].These problems motivate the development of automatic cell type annotation methods which annotate target data using annotated (i.e.labeled) source data.
Most of existing automatic cell type annotation methods focus on scRNA-seq(single-cell RNA) data, such as TOSICA [4], scmap [5], and scibet [6].Some methods applied to scRNA-seq data can be directly applied to scATAC-seq (single-cell Assay for Transposase Accessible Chromatin) data, such as LIGER [7], scMC [8], and Seurat v3 [9].However, the overall performance of direct application is not good, especially for methods that require converting the peakto-cell matrix into gene-to-cell matrix (i.e. gene activity matrix), which may result in the loss of some biological information [10].This suggests that an automatic cell type annotation model tailored to scATAC-seq data is necessary.
There are some cell type annotation methods specially designed for scATAC-seq data.Signac [11] is an end-to-end tool that enables complete analysis of scATAC-seq data, including cell type annotation and integration.Cellcano [12] firstly trains an Multilayer Perceptron (MLP) which can assign the pesudo label to the test data, and then trains a self-knowledge distiller model [13] on test data with high confidence.However, the above methods cannot predict unknown cell types; that is the cell types that exist in test data but not in training data.To the best of our konwledge, EpiAnno is the only method that can predict unknown cell types, which combines probabilistic generative model and Bayesian neural network [14].EpiAnno constructs a multivariate Gaussian distribution for each cell type, from which embedding is sampled.Then, the embedding is reconstructed to peak-to-cell matrix.This whole process is much like Variational AutoEncoder (VAE) [15].The weights of the model follow the set prior distribution, and in the phase of inference, the model uses its inverse function for cell type prediction by setting a fixed threshold to distinguish between known and unknown cell types.However, this strategy of fixed threshold may become problematic when batch effects vary among different batches.
There are some challenges for cell type annotation in scATACseq data: (i) Batch effects: differences in data distribution between source and target datasets (i.e.batches) due to technical inf luences during the experiment [16,17].(ii) The captured accessibility area can vary significantly between batches.(iii) Compared with scRNA-seq data, scATAC-seq data have higher sparsity and dimensionality, with nearly binarized values [18].(iv) The number of cell types may vary between different batches.
Domain adaptation is a technique to improve the performance of a model on a target domain containing insufficient annotated data by using the knowledge learned by the model from a different but related source domain with adequate labeled data.The domains can be different batches in the context of singlecell data.Recently, some methods based on domain adaptation have been proposed to solve batch effects in single-cell data [19].Zhou et al. [20] used pseudolabeling to minimize the class center between the training and test data for cell type annotation.Jialu et al.combined variational auto-encoder and adversarial learning strategy for batch correction and dimension reduction [21].Koff et al. integrated scATAC-seq data in a similar way [22].Yingxin et al. added variations related to confounding factors to represent batch effect and reduced its impact on the latent representation related to intrinsic cell states [23].However, methods above generally assume that the training dataset and the test dataset have the same cell type, which can not be applied to the test set with novel cell type.
Open-set domain adaptation assumes that test set contains not only all the types in the training set but also unknown type that does not exist in training set [24].Busto et al. assumed that there is also unknown type in the training set and used it to identify unknown type in the test set [24].Through adversarial training strategy, Saito et al. proposed OSBP, which draws a boundary between training set and test set and treats it as the threshold between known and unknown types [25].On the basis of OSBP, Liu et al. added multiple binary classifiers to progressively separate known and unknown samples [26].Zhang et al. assigned weights to each sample in adversarial learning and ignored samples with smaller weights in domain alignment [27].Sifan et al. selected samples with high confidence and aligned known samples by weighted adversarial learning [28].These open-set domain adaptation methods are proposed for computer vision, while there are no similar methods for cell type annotations.
Therefore, we propose One-vs-All Annotation(OVAAnno), a deep learning model which can annotate cell type and detect novel cell types with open-set domain adaptation.The embedding with biological information extracted by our method can be applied in downstream analysis, such as discovery of cell typespecific motifs and inference of transcription factor regulatory networks.

Methodology
OVAAnno consists of three parts: VAE, a closed-set classifier and an open-set classifier including multiple binary classifiers (Fig. 1).Through the VAE, we can obtain an information-rich embedding.Mapping the embedding to known cell types, the closed-set classifier selects the nearest known cell type for one sample (i.e. one cell).With adversarial training strategy, the open-set classifier identifies whether the sample belongs to known or unknown cell types.

Data preprocessing
Following the EpiAnno, for the scATAC datasets fed to model, when the peaks of the training set and those of the test set do not coincide, we map reads of training set to peaks of test set [14].Then, we binarize the peak matrices and only reserve the peaks that have at least one read count in at least 0.1% of cells.Finally, we select 20 000 highly variable peaks as features.The value in processed peak matrix indicates whether chromatin region is open or not(value 1 for open and otherwise).For the scRNA dataset, after filtering the genes that express in less than 0.1% of cells, we normalize the dataset by normalization and log function in scanpy with a size factor of 10 000.Finally, we select 3000 highly variable genes as features.

VAE and closed-set classifier
Here, the processed peak matrix X ∈ R N•m has N cells and m peaks (source: ), and each cell in source data has corresponding cell types Following the Kingma and Welling [15], we encode the data to the L-dimensional mean μ and variance σ parameters by encoder and take the sample Z ∈ R N•L from the approximate posterior distribution.Then, it is reconstructed as X ∈ R N•m by decoder.With sigmoid function, the output of decoder can be considered as the probability whether the region is open or not.Using the KL divergence(Equation 1), we approximate the distribution of Z to the Gaussian distributions, while using multivariable Bernoulli distribution to model the data by binary cross entropy loss(L recon ).For the scRNA data, we use mean square error loss as reconstruction loss.The source and target data share the same encoder and decoder.This is an effective strategy for solving domain adaptation problem [29].Because VAE is a generative model, the sample Z can be considered as the embedding of a new sample, which can be considered as data augmentation.Meanwhile, Z s from the source data are fed into the closed-set classifier with the optimization goal of cross entropy loss(L sup_c ). (1)

Open-set classifier on source data
Open-set classifier consists of K binary classifiers.For binary classifier j, it determines whether the sample is C j or not, and samples not belong to C j are negative samples.For each classifier, the number of negative samples may be larger than the number of positive samples.As negative samples include multiple cell types, the number of samples for each cell type in negative samples may vary greatly [30].To solve class-imbalance problem, motivated by [31], we define the open-set classifier loss on source data as Here, y i denotes the true cell type of x s i .In Equation 2, the higher pij , the higher the weight of the sample.High pij only occurs if positive samples has low probabilities or negative samples has high probabilities.That means each binary classifier pays more attention to hard-to-distinguish samples and ignores easyto-distinguish samples.In this way, the boundary lies between the known cell type and its nearest unknown cell types, which can minimize the risk of misaccepting unknown cell types as its near known cell types.To further alleviate the bias of open-set classifier, we also use α ij to reweight the positive sample loss by the ratio of negative and positive sample sizes.

Open-set classifier on target data
In a previous study [25], to minimize batch effects between source data and target data, OSBP's classifier sets a probability value boundary, which we denote it as P b .With adversarial training strategy, the training goal of OSBP's classifier is to make the probability value approximate P b , while the training goal of the generator is the opposite.The generator of OSBP decreases the probability for a known class but increases the probability for an unknown class.Finally, the differences of distribution between source known class and target known class are eliminated and unknown class is pushed away from known class.However, OSBP sets P b as a fixed value 0.5 which may not be suitable for each datasets for the reason that the difference in distribution between different datasets may be different.More importantly, the probability of unknown class that close to known class may fall near 0.5 and be greater than 0.5.As training proceeds, the probability of the unknown class may become higher and higher, eventually with unknown class identified as known class.
Our idea is to set a customized value boundary for each binary classifier separately and the vector P B consisting of all value boundary is adjusted automatically with the training process.Unlike OSBP, we add Gradient Reversal layer [32] after open-set classifier to reverse the gradient, so that the probabilities of target samples are pushed away from P B .Since the training of each binary classifier is similar, here, we use the jth classifier cls j which identifies whether target samples are cell type C j or not, as an example for the purpose of illustration.
As shown on the bottom right side of Fig. 1, we obtain the nearest known cell types C j of one target sample i by closed-set classifier, and this sample will be assigned to C j if the corresponding probability P t ji of C j from the open-set classifier is greater than 0.5, otherwise this sample will be assigned to unknown cell types.Through this process, we can obtain pseudo label Ŷt (i.e. one specific cell type) for each target sample.The vector P t j , each element of which is the probability of each target sample assigned to C j , is sorted in ascending order and the number at the top 5% interval of P t j is considered as P B j .With the removal of the batch effects, both P t j and P B j increase in training process.For one sample i not belong to C j , its P t ji may be higher than P B j , but generally lower than those of samples belong to C j .As P B j continues to rise, samples not belong to C j are gradually excluded from C j .For each mini batch, after training, Ŷt and P B in that mini batch are updated. (5) We also define weight β on target data which can help model to pay more attention to the target samples whose probability values are near the P B j , ignoring a large number of samples irrelevant to C j .Because class-imbalance may exist in the target data, the model may predict target samples belong to known cell types as unknown cell types without our reweighting strategy.Because P B j may become higher and higher with training, the weights of samples assigned unknown cell types may increase.We set the upper limit of weight for sample whose probability is below P B j and above 1 − P B j .The training algorigthm is shown in Algorithm 1.

Algorithm 1 Full algorithm
Input: (X s , Y s ), X t Output: Ŷt ,P t
For these methods, we utilized their default hyperparameters.
The preprocessing steps, such as selection of feature numbers, were uniformly executed utilizing the source code of each respective method unless stated otherwise.Following the benchmark method of Cellcano, we use scATAC-seq gene score matrix instead of the scATAC-seq peak matrix for all scATAC-seq methods except EpiAnno and Signac [12].The training process of these methods is consistent with that described in original paper.For the methods than cannot directly predict unknown cell types, including scvi and Signac, we set the thresholds as the values on which the methods can perform best on all tested datasets, 0.9 for scvi and signac.For all methods except Cellcano, cells with output confidence under the threshold are assigned as unknown cell types.For Cellcano which outputs entropy at the second phase, we sort the cells descendingly by entropy and assign the top K cells to unknown cell types, where K is the number of unknown cells in the test dataset.

Experiment
We collect four scATAC datasets (Table 1) for experiment.Since a dataset may contain multiple batches, we use each batch as a training or test set.Since the cell types in the Cusanovic dataset contain cell types from the other three datasets, we use Cusanovic dataset as test set and the other datasets as training set.To examine the robustness of our method, we use Cusanovic as training set and Forebrain as test set.We first remove cell types in Cusanovic but not in Forebrain, and then randomly remove one cell type in Cusanovic.In this way, the number of cell types in Forebrain is higher than that in Cusanovic.For 10X mouse brain dataset, Cusanovich dataset and Fang dataset, Luecken et al. used FASTQ files to reprocess them and unify their peak [10], which can be used directly for our experiment.The training configs are shown in Tables S1 and S2.

Implementation and evaluation metric
For scATAC-seq experiments, encoder is a neural network including four layers (3200-1600-800-400 Two evaluation metrics, effective assignment score (EAS) and effective multiple assignment score (EAS_M), are used in our experiments.We define N k as the number of test samples that belong to known cell types, N u as the number of test samples that belong to unknown cell types, N k k as the number of samples in N k predicted to be any of the known cell types, and N k u as the number of samples in N u predicted to be any of the known cell types.In a previous study [14], EAS is defined as Equation 7.However, EAS ref lects only whether a sample is assigned to known cell types or not, but not whether a sample is assigned to the corresponding correct cell type or not.We thus propose EAS_M(Equation 8).In EAS_M, N true k represent the number of samples in N k predicted to be their true cell types.For example, training set has two cell types A and B. Test set has three cell types A, B, and C. In EAS, it is correct that cells of cell type A are predicted as cell type A or B. But in EAS_M, the prediction is correct only when cells of cell type A are predicted as cell type A. EAS_M is a stricter metric than EAS.

EAS
We used Cohen's kappa, which is a minor-class sensitive approach thus can give us a comprehensive evaluation of classification performance, including the major types identification and the minor types identification.This metric is based on the confusion matrix, and its equation is where A ij represents the element in the ith row and jth column of the confusion matrix A, with the total number of samples being N.

Classification performance
As shown in Table 3, OVAAnno outperforms all benchmark methods on most datasets.Although OVAAnno and other methods get a lower score in EAS_M than EAS, OVAAnno shows more similar performance between EAS_M and EAS compared with EpiAnno.This shows that OVAAnno not only distinguishes known cell types from unknown cell types more accurately but also accurately predicts samples as corresponding known cell types.
When we use 10X or six batchs in Fang dataset as training sets, respectively(Table 3), OVAAnno still performs better than other methods in most datasets.EpiAnno's performance drops dramatically.This may be due to the large batch effects between training set and test set.We also combined each batch from Fang and 10X as the training set (Table S3).The results show that even though the training data with batch effects generated by different sequencing technologies, OVAAnno still performs better than other methods in general.Note that the number of cells of unknown cell types of Pre batch is only six, and identifying these rare unknown cell types is a challenged task for most methods.Signac performs well on Pre.This is because, at very high thresholds, Signac filters out almost all unknown types, resulting in relatively few cells of known types being predicted as unknown.Although Signac can perform best on this task with our chosen optimal threshold, it lacks robustness on other datasets and it is impractical to manually choose optimal threshold for every dataset.
As shown in Figs S1a and S2a, some cells of the same cell type are not clustered together because of batch effect.Compared with Cellcano (Fig. S1c) and Signac (Fig. S1e), OVAAnno can better separate unknown cell type with other cell types (Fig. S1b).EpiAnno appears to perform well in both integration and clustering.However, compared with ground truth (Fig. S1a), Epi-Anno predicts too many unknown types as known types (Fig. S1d), which may be due to negative transfer.An interesting observation for Signac output is that cells belong to excitatory neurons cell type are not clustered together (Fig. S1e), and a possible explanation is that multiple subtypes of excitatory neurons exist in the cortex.
On most datasets, OVAAnno performs better on the Kappa metric compared with other methods.Meanwhile, Cellcano significantly outperforms other methods on Pre, suggesting that it shows promise in predicting rare cell types.

Ablation experiment
In this experiment, using Fore batch, FA batch as training sets, respectively, and Cel batch, Pre batch as test sets, respectively, we evaluate the inf luence of L sup_o and P B on OVAAnno(Table 4).To evaluate the inf luence of L sup_o , we replace it with traditional binary cross entropy for each binary classifier in open-set classifier.Without L sup_o , it is hard for OVAAnno to draw a clear boundary between known cell types and unknown cell types for each binary classifier.As more samples of unknown cell types may have probability values higher than P B , OVAAnno w/o L sup_o tends to identify samples of unknown cell types as known cell types.From Table 4, OVAAnno's performance declines when L sup_o is replaced.
To demonstrate the effectiveness of adjusted P B during training, we replace it with fixed value 0.5.Without adjusted P B , the performance of OVAAnno drops drastically, as samples of unknown cell types are more likely to be misassigned to known cell types.It is possible for OVAAnno to filter the unknown cell types that close to known cell types by adjusted P B .
To demonstrate the effectiveness of weight β, we remove it during training.As shown in Table 4, OVAAnno's performance drops dramatically.For open-set binary classifier which identifies whether target samples are cell type C j or not, other cell types are considered as unknown cell types.There are far more cells of unknown cell types than C j .So the binary classifier tends to predict that the sample belongs to unknown cell types without β.
To demonstrate the effectiveness of loss L sup_t , we remove it during training.From Table 4, the performance of model drops significantly without L sup_t .without L sup_t , open-set classifier can only identify the cell types in the source data.For example, if source data has cell type C a and C b , but not includes cell type C c which exists in target data.For a binary classifier which identify whether a cell is C a , the binary classifier can identify C b as unknown cell type through supervised learning but it may not be able to distinguish C c for the reason that source data have little information about C c .L sup_t can make the classifier effectively Table 3. Performance comparison of methods on scATAC-seq data  methods, they all retained far more than 20 000 features.We varied the number of features for these methods.Cellcano, however, utilized all peak information, converting the peak matrix into a gene activity matrix.Due to the smaller number of genes, this experiment could not be performed for Cellcano.It can be observed that an increase in the number of features does not necessarily enhance the performance of these methods.On the contrary, it may introduce noise, thereby degrading performance.

ScRNA-seq experiment
We further examined the performance of our model on scRNA data by collecting six datasets (Table 2).For melanoma dataset, following the setting of a previous study [6] 5, OVAAnno outperform other methods in most datasets.In the four datasets tested, the performance of OVAAnno is the top on three datasets and is the third on the other dataset which has small batch effects between training and test sets.For XSMToBaron and BaronToXSM, Baron and XSM are produced by different scRNA-seq sequence technologies, leading to large batch effects between training and test sets.Moreover, XSM consists of three different datasets produced by different scRNA-seq technologies, so there are batch effects in XSM.OVAAnno can accurately identify unknown cell types in these experiments.

Disscussion and conclusion
In summary, we proposed OVAAnno, an automatic cell type annotation method combined with open-set domain adaptation for scATAC-seq data.In this work, not only simply removing batch effects, our method can also isolate unknown samples.Moreover, it can be applied to scRNA-seq data, which suggests that it can be used on single-cell data of different modalities.
Our future work is to extend the method so that it can be applied to multimodal data, such as scATAC data and scRNAseq data.One limitation is that our method shows moderate performance on datasets with small batch effects, the case that one dataset is split into training and test sets.Extending the application of our method to various types of datasets is a direction of improvement.

Key Points
• OVAAnno sets an automatically adjusted threshold to separate known samples and unknown samples.• OVAAnno sets a closed-set classifier which identifies the specific type of known sample to help open-set classifier to identify unknown samples.

Figure 1 .
Figure 1.Basic framework of OVAAnno; the orange line represents the data f low of the target set and the blue line represents the data f low of the source set; the source and target data share all network parameters.

Figure 2 .
Figure 2. Effect of parameter selection; (a and b) the choice of top interval for Pb on model performance.

Table 5 .
Performance comparison of methods on scRNA-seq data Dataset name annotation: #To * (# is training set, * is test set) [20]% of immune cells are randomly sampled as training set and the others as test set, denoted as MTrain and MTest, respectively.Peri batch of Macaque dataset is selected as training set and Fovea is selected as test set.Xin, Segerstolpe, and Muraro are combined into a new dataset XSM.Then following a previous study[20], XSM cells labeled as unclear, co-expression, unclassified, unclassified endocrine, alpha.contaminated,beta.contaminated,delta.contaminated,or gamma.contaminatedareremoved.When using XSM as training set and Baron as test set, those cell types available in XSM but not in Baron are removed, and a similar operation is performed when using XSM as the test set and Baron as the training set.For MTrainToMTest and PeriToFovea datasets, as the training and test sets are from the same source, there are small batch effects.On the other hand, for XSMToBaron and BaronToXSM datasets, the training and test sets are from different sources.The training configs are shown in Tables S6-S9.From Table